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^ ■ Abstract 

o: 

. Dimensionality is a major concern in analyzing large data sets. Some well known 

dimension reduction methods are for example principal component analysis (PC A), 
invariant coordinate selection (ICS), sliced inverse regression (SIR), sliced average vari- 
ance estimate (SAVE), principal hessian directions (PHD) and inverse regression es- 
■ timator (IRE). However, these methods are usually adequate of finding only certain 

types of structures or dependencies within the data. This calls the need to combine 
information coming from several different dimension reduction methods. We propose 
a generalization of the Crone and Crosby distance, a weighted distance that allows to 
combine subspaces of different dimensions. Some natural choices of weights are con- 
sidered in detail. Based on the weighted distance metric we discuss the concept of 
averages of subspaces as well to combine various dimension reduction methods. The 
performance of the weighted distances and the combining approach is illustrated via 
. simulations. 

l/^ I Keywords: Dimension reduction; Distance; Metric; Principal component analysis; 

t^;^ ■ Projection pursuit; Subspace. 

(N 
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(N '■ 1 Introduction 

Dimension reduction plays an important role in high dimensional data analysis. One then 
^ ■ wishes to reduce the dimension of a p-variate random vector x = (xi, . . . , Xp)' using a trans- 
^ . formation z = B'x where the transformation matrix B is a. p x k matrix with linearly 

independent columns, k < p. The column vectors of B then span the A;- dimensional subspace 
of interest. The transformation to the subspace can also be done using the corresponding 
p X p orthogonal projector Pb = B{BB')~^B' . The transformation z = Pbx projects the 
observations to a linear fc-variate subspace. 

There are two main types of dimension reduction - unsupervised and supervised. Princi- 
pal component analysis (PGA) is perhaps the best known unsupervised dimension reduction 
method. PCA finds an orthogonal transformation matrix in such a way that the components 
in the new coordinate system are uncorrelated and ordered according to their variances. In 
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dimension reduction, only the k components with highest variances are taken. Indepen- 
dent component analysis (ICA) is another example of an unsupervis ed dimension reduction 
method. The fourth-order blind identification (FOBI) ( Cardosdll989[ ) procedure then finds a 
transformation matrix in such a way that the new components are uncorrelated with respect 
to two distinct scatter matrices, the regular covariance matrix and a scatter matrix based 
on fourth moments, and ordered according to their kurtosis values. Then the k components 
with kurtosis values most deviating from that of a normal distribution are most interesting. 

In dimension reduction, the goal is often to use the transformed (reduced) variables to 
predict the value of a known response variable y. In supervised dimension reduction, the 
joint distribution of x and y is then taken into consideration in the dimension reduction of 
X, and it is hoped that y i x\B'x. Sliced inverse regression (SIR) ( lLilll99ll ) is a well known 
supervised dimension reduction method. It is, along with ICA, based on the comparison 
of two scatter matrices, where the second scatter matrix depends - unlike in unsupervised 
dimension reduction methods - on the joint distribution of x and y. Other well known dimen- 
sion reduction methods are for example the sliced average variance estimate (SAVE) and the 



principal hessian directions (PHD) (see Cook and Weisberg 1991 : Li 19921 respectively). A 



recent contribution to the field of sup ervised dimension reduction methods is the supervised 
invariant coordinate sel ection (SICS) ([Liski et al.ll2012l) . It is an extension of invariant coor- 
dinate selection (ICS) ( iTyler et al.ll2009l ). Both ICS and SICS are based on the comparison 
of two scatter matrices 5*1 and 5*2, and the transformation matrix is based on the eigenvalues 
and eigenvectors of In SICS however, the second scatter matrix depends on the joint 

distribution of x and y. 

Individual dimension reduction methods are usually adequate to find only special types 
of subspaces or special relationships between x and y. For example, SIR works well for lin- 
ear relationships but ma.y completely fail for other type of dependencies (see for example 
Cook and Weisberglll99ll ). Hence, there is a need for new approaches, which would use the 



good properties of individual dimension reduction methods and combine the information in 
an efficient way. The aim of this paper is to combine different dimension reduction methods 
in such a way that the "best qualities" of each method are picked up. To combine different 
dimension reduction methods is to say that we combine the individual orthogonal projectors 
possibly with various ranks and find an "average orthog onal projector" (AOP) wi th an opti- 
mized rank. Our approach is similar to the approach in lCrone and Crosbyl (119951 ). The idea 
is to find the AOP which is, on the average, closest to individual orthogonal projectors with 
respect to some distance criterium. 

The paper is organised as follows. In Sect ion [2] we discuss su b space s and propose a gener- 



alization of the Crone and Crosby distance. I Crone and Crosbyl ( 119951 ) considered subspaces 



of equal dimensions, whereas our weighted distance allows subspaces of different dimensions. 
Some natural choices of weights are given. Furthermore, the concept of averages of sub- 
spaces is discussed. In Section [3] the performance of the weighted distance and the different 
AOP's is evaluated in two unsupervised dimension reduction applications and one supervised 
dimension reduction simulation study. The paper ends with some final remarks. 
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2 Subspaces and distances between subspaces 
2.1 Subspaces with the same dimension k 

We first consider linear subspaces in MP with a fixed dimension k, 1 < k < p. A linear 
subspace and the distances between the subspaces can be defined in several ways. 

1. The subspace is defined as a linear subspace spanned by the linearly independent 
columns of a. p x k matrix B, that is, Sb = {Ba : a G M'^}. This definition based 
on a matrix 5 is a bit obscure in the sense that Sb = Sba for all full-rank k x k 
matrices A. According to this definition, the same subspace can in fact be fixed by any 
member in a set of matrices equivalent to B, 

{BA : A is a full-rank k x k matrix}. 

The non-uniqueness of B may cause technical problems in the estimation of a subspace. 
Consider two pxk matrices Bi and B2 with rank k. Then a measure of distance between 
subspaces spanned by Bi and B2 can be defined as k — J2^=i Pi = ^ — trf-R^-R) where 



Pi, . . . , pI are the squared canonical correlations between Bi and B2 ( lHotellinglll936l ) 
and R = {B[Bi)-^/'^B[B2{B'^B2)~^/'^ . Note that if Bi and B2 are equivalent then the 
squared canonical correlations are all 1. 

2. The subspace is defined as a linear subspace spanned by the orthonormal columns of 
a. p X k matrix U. Note that, starting with 5, one can choose U = B{B'B)-^/^ for 
this second definition. Unfortunately, the definition is still obscure as Su = Suv for all 
orthonormal k x k matrices V, and the same subspace is given by any matrix in the 
class of equivalent orthonormal matrices 

{UV : V is an orthonormal k x k matrix}. 

The principal angles 6i G [0, 7r/2] between the subspaces Ui and U2 with corresponding 
fc-variate direction vectors Ui and Vi i = 1, . . . , k, are recursively defined by maximizing 
u[{U[U2)vi subject to the constraints u'^Ui = v[vi = 1, and u'jUj = v'^Vj = 0, j = 1, . . . , i — 
1. The ith principal angle is then cos = u[{U[U2)vi, i = l,...,k, and a measure of 
distance between the subspaces may be obtained as A; — J2i=i cos^ 9i = k — "^i^iiu'^ViY . 
It is easy to see that it equals to A; — J2i=i Pi- 

3. The subspace is defined as the linear subspace given by an orthogonal projector P, that 
is, a p X p transformation matrix P such that 

(xi — Pxi) ± Px2 for all Xi, X2 G W which is equivalent to P = P' = P^. 

Matrix P provides a unique way to fix the subspace Sp = {Px : x G W}. Note that, 
starting from B, one can define P = Pb = B{B'B)^^B' in a unique way. Starting from 
U gives similarly P = Pu = UU'. The squared distance between the subspaces given 
by two orthogonal projectors Pi and P2 may then be defined as the matrix (Frobenius) 
norm 

k k 

||Pi - P2II' = 2{k - tr(PiP2)) = 2{k - ^cos'^O = 2(/c - J^p-)- 

i=l i=l 
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Crone and Crosbyl ( 1l995l ) use 



D(Pi, P2) = {k- tr{P^P2)f'^ = ||Pi - P2II 

as a distance between two fc- dimensional subspaces oiW given by orthogonal projectors 
Pi and P2. It is then easy to see that < D'^{Pi,P2) < min{k,p — k} and that the 
distance obeys the triangular inequality D{Pi,P^) < D{Pi,P2) + D{P2,P-i) for any 
orthogonal projectors Pi, P2 and P3. 



2.2 Subspaces with arbitrary dimensions 

Assume next that the ranks of the projection matrices Pi and P2 are ki and ^2, respectively, 
where fci, /c2 = 0, For completeness of the theory, we also accept projection matrices P = 
with rank = 0. As ||Pi — P2IP > \ki — ^2!, one possible extension of the above distance is 

D(Pi,P2) = ^ [11^1 - ^2||' -\ki- k2\]^'^ . Then < D\P,,P2) < rmn{k,, k2,p-k,,p-k2} 
but, unfortunately, the triangular inequa lity is not true fo r this distance. We therefore 
consider other extensions of the metric by Crone and Crosbyl f 19951 ) . 

Let w{0), . . . , w{p) be positive weights attached to dimensions 0, . . . ,p. (We will later see 
that the choice of w{0) is irrelevant for the theory.) We then give the following definition. 

Definition 2.1 A weighted distance between subspaces Pi and P2 with ranks ki and k2 is 
given by 

Dl{P,,P2) = ^\\wik,)P,~wik2)P2\\'. (2.1) 

The weights are used to make the orthogonal projectors Pi and P2 with different ranks more 
comparable in some sense. As the distance Dw{Pi, P2) is based on the matrix (Frobenius) 
norm, (i) P'^(Pi,P2) > 0, (ii) P'^(Pi,P2) = if and only if Pi = P2, (iii) P'^(Pi,P2) = 
Dw{P2, Pi), and (iv) Dw{Pi, P3) < D^^Pi, P2) + P'tt,(P2, P3), and we have the following result. 

Proposition 2.1 For all weight functions w, Dw{Pi,P2) is a metric in the space of orthog- 
onal projectors, and the strict lower and upper bounds of Dlj{Pi, P2) for the dimensions ki 
and k2 are 

m{ki, k2)-w{ki)w{k2) min{A;i, k2} < -D^(Pi, P2) < m{ki, k2)+w{ki)w{k2) mm{p-ki-k2, 0} 
where 

m(fcl,fc2) = "'^^^^^^ + "'^^^^^^ 
Some interesting choices of the weights are, for A; > 0, 

(a) Wa{k) = 1, (6) Wb{k) = and (c) Wc{k) = 

Weights in (a) give the distance by Crone and Crosbyl ( 1995 ). Weights in (6) and (c) stan- 
dardize the matrices so that tr{w{ki)Pi) = 1 and ||ty(fci)Pi|| = 1, respectively, if ki > 0. It is 
remarkable that 

tr(PiP2) 



Di(P,,P2) = l- 



^tr{Pi)tr{P2 
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where 



tr{PiP2) 



f ec(Pi)'f ec(P2) 



y/tr{Pi)tr{P2) ^vec{Pi)'vec{Pi)^vec{P2)'vec{P2) 

is a correlation between vectorized Pi and P2. 

Proposition 12.11 implies that, for nonzero ki and /c2, the distances D^(Pi,P2) get any 
values on the closed intervals 
'1,. . , 1 

2 
'1 



(c) 



\ki - k2 

,-1 



-{ki + ^2) + niin{p - ki - ^2, 0} 



k^^ — k2 ^|,-(A;i ^ + k2 ^) + k^^k2^ in.m{p — ki — k2,0} , and 



mm{ky^ k2 k^ ^^"^k^"^}, 1 + k^ ^^"^k^ min{p — k\ — k2, 0} 



If ki = 0, for example, then D^(Pi, P2) is simply 1(7^(^2)^2/2. Recall that, for all three choices 
of weights, the distance is zero only if Pi = P2 (and ki = /c2). For weights Wa, the largest 
possible value for D^(Pi, P2) is p/2 and it is obtained if and only if Pi and P2 are orthogonal 
and Pi + P2 = Ip (i.e., k, + k2=p). For weights Wb, Dl{Pi, P2) < 1, and Dl{Pi, P2) = 1 if 
and only if Pi and P2 are orthogonal and fci = ^2 = 1- Finally, for weights Wc, the maximum 
value D.u}{Pi, P2) = 1 for fci, ^2 7^ is attained as soon as Pi and P2 are orthogonal and 
ki + k2 < p. 

The following two special cases illustrate the differences between the three distances. 
1. First, consider the case when Sp-^ G Sp2. Then naturally tr{PiP2) = tr(Pi) = ki and 

w^{ki)ki + w^{k2)k2 



w{ki)w{k2)ki 



and therefore, for ^2 7^ and with A = ki/k2 



DISP1.P2) 

<(A,P2) 
DISP1.P2) 



|(1-A), 
^(1-A), and 

1 - yx. 



One can see that /^^^(Pi, P2) depends only on the ratio between ki and ^2, which can 
be seen as a nice feature. D^^(Pi, P2) and D^^(Pi, P2) however depend additionally on 
the actual values of ki and k2- 



2. Second, consider the case when Sp^ and Sp^ are orthogonal, that is, when tr{PiP2) 
Then 

2.p _ W\ki)ki + W^k2)k2 

J-'wK^i^^V - ^ 

and therefore, for nonzero ki and /c2, 

1 

2 
1 

■w^kPi^P^) = 2 



0. 



DISP1.P2) 



Al(Pl,P2 



{ki + k2), 

1 1 

ki k2 



- \ z — h — , and 



D^JP^P^) = 1. 
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It is natural to think subspaces that are orthogonal to each other are furthest apart 
possible. This information is apparent in D'^^{Pi, P2). However, interpreting both 
Dl^^^Pi, P2) and D^^(Pi,P2) is again more difficult since they depend on the actual 
values of ki and ^2. 

2.3 Averages of subspaces with arbitrary dimensions 

Consider orthogonal projectors Pi, ... , Pm with ranks ki, . . . , km- To combine the projectors 
we give the following 

Definition 2.2 The average orthogonal projector (AOP) P^ based on weights w{0), . . . , w{p) 
is an orthogonal projector that minimizes the objective function 

^ m 

m ^-^ 

1=1 

To find the AOP, we can use the following result. 
Lemma 2.1 The AOP P^ maximizes the function 

D{P) = w{k)tr{P^P) - ^w\k)k, 

where 

^ m 

Pn, = -y2w{ki)Pi 

i=l 

is a regular average of weighted projectors, and k is the rank of P. 

Naturally, P^ is symmetric and non-negative definite, but not a projector anymore. In 
the following derivations, we need its eigenvector and eigenvalue decomposition 

p 

P^ = UAU' = XiUiu'i 

i=l 

where Ai > . . . > Ap > and Ui is the eigenvector corresponding to the eigenvalue A,. Recall 
that the eigenvectors are uniquely defined only for eigenvalues that are distinct from other 
eigenvalues. Using the Lemma [271] and the eigenvector and eigenvalue decomposition P^,, our 
main result easily follows. 

Proposition 2.2 The rank k of the AOP P^ maximizes the function 

k ^ 
f^{k)=w{k)(Y\i)I{k>{])--w\k)k, k = 0,...,p, 

i=l 

where Ai > . . . > Ap > are the eigenvalues of Pw Moreover, 

k 

P^ = /(A;>0)-^m,m', 

j=i 

where Ui, . . . ,Uk are the eigenvalues corresponding to eigenvalues Ai, . . . , A^. 
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Note that the calculation of the AOP Py^ is easy, only the eigenvalues and eigenvectors of 
Pw are needed. The AOP P^ is not always unique. This happens for example if the rank of 
an AOP is k and A^+i = A^. Consider now the three weight functions 

(a) Wa(k) = 1, (b) Wh(.k) = — , and (c) wJk) = —=, for k > 0. 

k y/k 

The function for these three weight functions is, for A; > 0, 

Note that /^(O) = for all weights w. To find local maxima for these functions, one can 
then use the results 

(a) : Uk + l)>Uk) 

(b) : Uk + l)>Uk) 

(c) : Uk + l)>Uk) 
for fc = l,...,p — 1. 

Note that for (a), fw{k) is a concave function and the global maximum is simply the 
largest k with the eigenvalue > |. The functions in (6) and (c) are not concave, however, 
and the global maximum is found by computing all the values fw{k), k = 0, ...,p. 



Ajfc+i > -, 



^ Afe+i > ^ Ai + . . . + A 



and 



^ Afe+i > 



k + 1 



1 (Ai + . . . + Afc) 



3 Applications 



In this section we discuss the performance of the averages of orthogonal projectors (AOP) for 
three different dimension reduction problems. The orthogonal projectors and their combina- 
tions aim for different targets in different applications. Each problem with natural projectors 
will be first shortly introduced, and then the performance of AOP is demonstrated using simu- 
lation studies. The computations in this section are done using R (IR Development Core Team 



pcaPP (IFilzmoser et al.ll2012l ) and robustbase (iRousseeuw et al. 



2012 ) b y mainly using the pa ckages dr ( Weisber^2002ll. MNM (Nordhausen and Oja 2011 ) 



2012h . 



3.1 Principal component Analysis 

Classical principal component analysis (PCA) may be based on the eigenvector and eigenvalue 
decomposition of the covariance matrix of a p-variate random vector x, that is, on 

p 

cov{x) = UAU' = \iUiu[ 

i=l 

where Ai > ...Ap > are the ordered eigenvalues and Ui, ...,Up are the corresponding eigen- 
vectors. Orthogonal projector Pcov = Yli=i '^i'^'i then projects p-variate observations to the 
fc-variate subset with maximum variation. It is unique if A^+i > A^. 
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Let Fx be the cumulative distribution function of x. A p x p matrix valued functional 
S{F) is a scatter matrix if S{F) is a non-negative definite and symmetric matrix with the 
affine equivariance property 

S{Fjix+b) = AS{Fx)A' for all full-rank p x p matrices A and all p-vectors b. 

It is remarkable that, if x has an elliptic distribution then the ordered eigenvectors of S{Fx) 
are those of cov{x). Therefore, in the elliptic case, any scatter matrix can be used to find P = 
Yl'i=i '^i'^'i ^he matrix P is a well-defined population quantity even if the second moments 
(and the covariance matrix) do not exist. Naturally, the sample statistics corresponding to 
different scatter matrices then have different statistical (efficiency and robustness) properties. 
For a fixed value of /c, one can then try to "average" these different PCA approaches to get 
a compromise estimate. 

We next illustrate the performance of the AOP in the following simple scenario. Let 
first X ~ Nq{^,A) a = (iia5'(9, 7, 5, 1, 1, 1). We choose = 3 and wish to estimate Pcov = 
diag{l, 1, 1,0,0,0). Let then Xi, be a random sample from x ~ A^6(0, A), and find an 

estimate PcSv, an orthogonal projector with rank k = 3 obtained from the sample covariance 
matrix. This estimate is then combined with three robust estimates, namely. 



Pxyier that is bascd on Tyler's shape matrix ( iTylerl 119871 ) with the affine equivariant version 



of sp atial median as a multivariate location estimate ( iHettmansperger and Randies 



20021 ). 



Pmc b tha t is based on the minimum covariance determinant (MCD) estimator (IRousseeuw 



7 /7 Ilia I 
Il986h . 



Ppp that is based on projection pursuit (PP) approach for PCA with the rn edian absolute 



deviation (mad) criterion as suggested in ICroux and Ruiz-Gazeru (120051 ). 



In the simulations, xi,...,x„ was a random sample from A'6(0,A) with n = 400, and the 
sampling was repeated 1000 times. As ki = ... = km = k = 3 is fixed , we use only Wa as the 
weight function. The average squared distances Df^^ between the four projector estimates, 
their AOP P^^, and Pcov are shown in Tabled! A similar simulation study was conducted 
but with observations coming from a heavy-tailed elliptical t2 distribution with 9, 7, 5, 1, 1, 1 
as proportional eigenvalues. Note that the regular scatter matrix does not exist in this case 
but the true projection matrix is still well defined. 



Table 1: Average squared distances between the four projector estimates, their AOP 
Pw^, and true Pcov For all projectors, rank A; = 3. The averages are based on 1000 random 
samples of size n = 400 from A'"6(0, A). 





^ cov 


Pxyler 


Pmcd 


Ppp 


p 


p 

^ cov 


p— 

^ cov 


0.000 


0.002 


0.002 


0.056 


0.004 


0.005 


Pxyler 


0.002 


0.000 


0.001 


0.054 


0.004 


0.007 


Pmcd 


0.002 


0.001 


0.000 


0.055 


0.004 


0.007 


Ppp 


0.056 


0.054 


0.055 


0.000 


0.031 


0.061 


P 

Wa 


0.004 


0.004 


0.004 


0.031 


0.000 


0.009 



The results in the multivariate normal case show, as expected, that the projector estimate 
based on the covariance matrix is the best one here. Also the average projector performs very 
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well although it combines information coming from much worse Ppp. In the ^2 distribution 
case, traditional PcSu fails but the average projector is still performing well (see Table 
Recall the second moments and Pcov do not exist in this case. 

Table 2: Average squared distances D"^ between the four projector estimates, their AOP 
P^^, and true Pcov For all projectors, rank k = 3. The averages are based on 1000 random 
samples of size n = 400 from t2 with eigenvalues 9, 7, 5, 1, 1, 1. 





P— 

^ cov 


Pryler 


Pmcd 


Ppp 


p 


P 

cov 


cov 


0.000 


0.110 


0.122 


0.169 


0.074 


0.114 


Pxyler 


0.110 


0.000 


0.006 


0.063 


0.010 


0.007 


Pmcd 


0.122 


0.006 


0.000 


0.066 


0.014 


0.012 


Ppp 


0.169 


0.063 


0.066 


0.000 


0.042 


0.070 


p 


0.074 


0.010 


0.014 


0.042 


0.000 


0.016 



3.2 Averaging one-dimensional PP projectors 



In the previous section we used projection pursuit (PP) approach for principal compo- 
nent analysis. PP is a much more general technique, however, and there are many other 
types of indices than just measures of variation to define " intere sting" one-dimensional di- 
rections. PP actually dates back to iFriedman and Tukeyl (jl974j ) and usually one searches 
for nongaussian direction s . Fo r a recent review of the existing indices, see for example 



Rodriguez-Martinez et al.l ( l2010l ). A major challenge in PP is that it is computationally dif- 
ficult to find the direction which globally maximizes the index and that there are usually 
several local maxima. However, since the l ocal m axima may be also of interest, one possible 
strategy, as detailed in iRuiz-Gazen et al.l (|2010[ ). is to run the algorithm many times using 
different initializations. With this strategy, the user has many projectors of rank one but 
many of them are usually redundant. So, it is of particular interest to summarize all these 
projectors in order to extract the directions that are useful and unique. It means that, in 
that case, one is interested in an average projector of projectors with rank one that may have 
a higher rank. 

To demonstrate the interest of AOP in the context of PP, we choose the deflation-based 



fastICA method (jHyvarineru Il999l ) as an example since it is well-understood and computa- 
tionally quite simple. While deflation-based fastICA is originally developed in the context 
of independent component analysis (ICA), it can be seen as a traditional PP approach 
when only one direction is extracted. For a random variable x with the standardized 
version z = cov{x)~^^'^{x — E{x)), deflation-based fastICA maximizes a measure of non- 
gaussianity of the form \E{G{u'z))\, under the constraint that u'u = 1, where G is a se- 
lected twice differentiable nonlinear nonquadratic function with G{0) = 0. The final PP 
direction is then {u'cov{x)~^u)~^^'^cov(x)~^^'^u, and the corresponding orthogonal projector 



In our simulations, 



we use four common choices 

2 



is {u'cov{x)~^u)~^cov{x)~^^'^uu'cov{x)^^^'^ 

of G{u) with derivative functions g{u): (i) , (ii) tanh(-u), (iii) -u exp(-u^/2), and (iv) u'^. If 
there are more than one non-gaussian direction in the data, the direction to be found depends 
heavily on the initial value of the algorithm, see e.g. iNordhausen et al.l (120111 ). 

In our simulation study, we choose a 10-variate x = (xi, . . 
variables are mixtures of gaussian distributions and Xj ~ N{0, 



1), for 



where the first three 
4 10. More 
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precisely, Xi = -^{ViVi + (1 - Vi)y2 - 2) with pi ~ 5m(l,0.5), yi ~ A^(0, 1) and ?/2 ~ 
iV(4,l) ; X2 = 7=(P2y3 + (1 -P2)2/4 - 2.1) with p2 ~ 5zn(l,0.3), 7/3 ~ iV(0, 1) and 
7/4 ~ Ar(3,l), and X3 = 7=(P3Z/5 + (1 - ^3)1/6 - 2) with ~ 5m(l,0.4), 7/5 ~ Ar(0,9) 
and ?/6 ~ ^(8) 9). We generated 1000 random samples of sizes n = 200 from the 10-variate 
distribution described above. For each sample, we found 100 one-dimensional PP directions 
(4 choices of G, 25 random initial values for the algorithm for each choice of G). For each 
sample, 100 PP projectors were then averaged using each of the three weight functions Wa, 
Wb and Wc- The average projector should in this setting then be close to the projector 
Ptrue = diag{l, 1, 1, 0, . . . , 0) with rank 3 that picks the three non-gaussian components of 
the data. 




Pw, P^B P^, 



Figure 1: Relative frequencies of the estimated ranks of the AOP using the weight functions 
Wa, Wb and Wc- 

Figure [1] shows the relative frequencies of the ranks of the AOPs obtained with the three 
weight functions in 1000 repetitions. Clearly the weight function Wa is not appropriate in 
this application because fci = /c2 = ■ . ■ = /cm = 1 implies that Yl^i = ^ with A, > 0, which 
means that there cannot be more than one eigenvalue larger than 1/2 and, consequently, the 
rank k equals zero or one. With weight functions Wb and Wc, the correct rank 3 is obtained 
in 82.6% and 68.3% of the runs, respectively. It is also hoped that the AOPs are close to the 
true projector Ptme- To evaluate that, we found that the the average distances Z)^^ between 
Pw^ and Ptrue, between P^,^ and Ptme, and between Py^^ and Ptme were 1.005, 0.122, and 
0.199, respectively. The same numbers for the distances based on Wb and Wc are 0.335, 0.018, 
and 0.035, and 0.425, 0.043, and 0.0736, respectively. Notice that, for all distances, the AOP 
Pw^ is closest on average to the true value Ptme- 

3.3 Supervised dimension reduction 

In the PGA application, we used the same k = 3 for orthogonal projectors and their AOP. 
In the PP application, the rank of the orthogonal projectors was taken as one while the rank 
of their AOP was not fixed. However, for many dimension reduction methods, the ranks of 
the individual orthogonal projectors are not fixed but also estimated from the data, and the 
ranks may differ from one method to another. We now look at this scenario in the framework 
of supervised dimension reduction. 
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In supervised dimension reduction, one often assumes that a response variable y and the 
p- vector X are related through 



y = fib'^x,...,!)^ 



with an unknown function / and an unknown error variable e. The goal of supervised 
dimension reduction is to estimate the value of k and the matrix B = {bi, . . . ,bk) to obtain 
Pb with rank k. Hence, for supervised dimension reduction, the joint distribution of y and 
X is of interest and, for the matrix B, it holds that y X x\B'x. 

Many supervised dimension reductions are suggested in the literature and their perfor- 
mances often strongly depend on the unknown function /. The well-known sliced inverse 
regression (SIR) for example may not find directions with nonlinear dependencies while, on 
the other hand, principal Hessian directions (PHD) cannot find linear relationships. Hence, 
when using supervised dimension reduction methods in practice, the estimated rank k and 
the corresponding projector might differ considerably depending on the method. We propose 
to use the AOP in order to summarize in an efficient way the information brought by the 
complementary estimation strategies. 

In our example we generate data sets from the following three models. 

Ml: y = b[^x + [b'^^xf + ae 
M2: y = 5 + 621a; + ae 
M3: y = {b'^^xf + ae 

where x ~ iVio(0, Jio), e ~ A^(0, 1), a = 0.5 and 6jj's are all 10-dimensional row vectors 

611 = (2,3,0,...,0)'and6i2 = (0,0,-5,0,...,0)', 
621 = (1,1,1,0,...,0)', and 
631 = (1,0,0,0,. ..,0)' 

Hence k = 2 for model Ml and k = 1 for models M2-M3. In each case, we generated 100 
samples of size 400. 

In our illustration, we use supervised dimension reduction methods implemented in the 
dr package that provide both the estimate of k and the orthogonal projector estimate with 
the same rank k. The estimation strategies are then (i) sliced inverse regression (SIR), (ii) 
sliced variance estimation (SAVE), (iii) inverse regression estimation (IRE), and three types 
of principal hessian directions (PHD), namely, (iv) response based principal hessian directions 
(PHDY), (v) residual based principal hessian directions (PHDR), and (vi) the so called q- 
based pri n cipal hessian directions (PHDQ). For details about these estimation methods, see 
Weisberg and references therein. We also add here (vii) PCA with k chosen simply 



as the number of eigenvalues larger than 1. Naturally, PCA ignores y and is therefore not 
supervised. (Its use could be motivated by the aim to avoid directions with small variation. 
In our case it just provides random projectors.) 

In the following we want to compare the seven methods above and their AOPs based on 
the weight functions and Wc- The use of Wa is not reasonable with varying k. We consider 
here the following four AOPs. 

AOPl: The AOP using Wb with fixed and true k. 



A0P2: The AOP using Wc with fixed and true k. 
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A0P3: The AOP using Wb with estimated k. 
A0P4: The AOP using Wc with estimated k. 



■B 



SIR SAVEPHDRPHDYPHDQ IRE PCA AOPl A0P2 AOP3 AOP4 



a 



SIR SAVEPHDRPHDYPHDQ IRE PCA AOPl AOP2 AOP3 AOP4 



Figure 2: Boxplots of the observed (left panel) and Z)^^ (right panel) distances between 
the true and estimated projection matrices when the observations come from the model Ml 
with k = 2. 



Some simulation results are collected in Figures [2]- HI The figures show the boxplots for 
the observed and distances between the true orthogonal projector and the projector 
estimates coming from different dimension reduction approaches. Consider first the behavior 
of the estimates in the model Ml with k = 2 (see Figure |2]). The performances of SIR, 
SAVE and PHD estimates seem to be very similar and they usually find only one direction. 
(For example, SIR finds only the component with linear dependence, and SAVE only the 
component of quadratic dependence.) The same seems to be true with IRE but with more 
varying estimates. Recall that, if iSp^ C Sp^ and ki = 1 and k2 = 2, then A = ki/k2 = 0.5, 
and the average distances of SIR, SAVE and PHD estimates tend to be close to 

D2^(Pi,P2) = ^(l-A) = 0.25 and Dl^{P,, P2) = 1 - ^/X = 0.293, 

respectively. The AOP estimates then nicely pick up the two dimensions and clearly out- 
perform other estimates. Note that there is no big difference between AOP estimates with 
known k and AOP estimates with estimated k. The AOP estimate based on Wc seems to be 
better. PCA has a poor performance as expected. 

Figure [3] shows the results when the observations come from the model M2. The model 
with linear dependence only is then of course the model where SIR is the best one. IRE 
also performs quite well but, for most samples, SAVE and PHD approaches do not find any 
solution at all. Recall that, if fci = and k2 = I then P'^^(Pi, P2) = P'^,(Pi, P2) = 0.5. The 
AOP estimates seem often to pick up the correct subspace, and there is no real difference 
between the Wb and Wc estimates. This time, the AOP estimates with known dimension k = 1 
seem to perform better than the AOM estimates with estimated k. 

Figure m gives the results for the model M3 with k = 1 and quadratic dependence. SAVE 
and PHD approaches work very well, and SIR and IRE completely fail in this case. Again, 
all AOP estimates neglect the bad estimates and pick up nicely the correct one direction. As 
in the other cases, PCA provides a random reference method with a bad performance indeed. 
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SIR SAVE PHDRPHDYPHDQ IRE PCA A0P1 A0P2 A0P3 A0P4 SIR SAVE PHDRPHDYPHDQ IRE PCA A0P1 A0P2 A0P3 A0P4 



Figure 3: Boxplots of the observed D^^^ (left panel) and L>^^ (right panel) distances between 
the true and estimated projection matrices when the observations come from the model M2 
with k = 1. 



T -9- 



SIR SAVE PHDRPHDYPHDQ IRE PCA A0P1 A0P2 A0P3 A0P4 



— -8- 



SIR SAVE PHDRPHDYPHDQ IRE PCA A0P1 AQP2 A0P3 A0P4 



Figure 4: Boxplots of the observed D^^^ (left panel) and (right panel) distances between 
the true and estimated projection matrices when the observations come from the model M3 
with k — 1. 
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4 Final comments 

Dimension reduction and subspace estimation is a topic with increasing relevance since mod- 
ern datasets become larger and larger. Different approaches have different shortcomings and 
combining the results coming from different approaches might give a better total overview. 
In this paper, we propose a generalization of the Crone and Crosby distance for the orthogo- 
nal projectors, a weighted distance that allows to combine subspaces of different dimensions. 
Some natural choices of weights are considered in detail. The performance of three weighted 
distances and the combining approach is illustrated via simulations which show that each of 
them has its own justification depending on the problem at hand. Similar to other areas of 
statistics, this kind of "model averaging" seems to be a way to combine information from 
competing estimates and to give a better idea of the true model at hand. 
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Lemma .1 For two pxp orthogonal projectors Pi and P2 with ranks ki and k2, respectively, 

max{p — ki — k2, 0} < tr{PiP2) < min{A;i, /C2}. 

Proof First note that Pi = UiU[ and P2 = U2U2 where Ui has ki orthonormal columns and 
U2 has k2 orthonormal columns. Then tr{PiP2) = \\U[U2\\'^ > 0. As tr{PiP2) + tr{Pi{Ip - 
P2)) = tr(Pi) = ki and tr{PiP2) + tr{{Ip - ^1)^2) = tr{P2) = k2 one can conclude that 
tr{PiP2) < min{A;i, A;2}. Similarly, tr(Pi(/p — P2)) < mm{ki.p — k2} and therefore tr (Pi P2) = 
ki — tr{Pi{Ip — P2)) > ki — in.m{ki,p — k2} = maxj/o-i + k2 — p, 0}, and the result follows. 

Note also that the lower and upper bounds in the above lemma are fixed. The upper bound 
is obtained with the choices 



0.9-2. 



1-22. 
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1=1 



1=1 



and the lower bound with the choices 




1=1 



i=p—k2+l 



where ej is a p- vector with the ith component one and other components zero. 
Proof of Proposition 2.1 One easily sees that 



= m{ki, /C2) - w{ki)w{k2)tr{PiP2), 
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and the proof follows from Lemma [T] 
Proof of Lemma 2.1 As shown before, 

Dl{P,, P) = ^w\ki)ki + ^w\k)k - w{h)w{k)tr{PiP). 

Then 

The first term in the latest sum does not depend on P or k. Thus, cr^(-P) is minimized when 
w{k)tr[PwP) — ^w'^{k)k is maximized. 

Proof of Proposition 2.2 The AOP Pw maximizes 

D{P) = w{k)tr{P^P) - ]^w'^{k)k, 

where k is the rank of P. Assume first that > is fixed and P = VV where V has k 
orthonormal columns. Then D{P) is maximized as soon as tr{P^P) is maximized. Then, 
as Pw = Y7i=i t^{PwP) = tr{Pii,VV') = tr{V'PwV) is maximized if \^ = ...,Uk), 

and the maximum value is Yli=i ^i- -^^^ fixed k > 0, the maximum value of D[P) is then 
w{k) EjLi Xi - \w'^{k)k, and ^(0) = 0. The result follows. 



